Nanopore sequencing of a monkeypox virus strain isolated from a pustular lesion in the Central African Republic

Monkeypox is an emerging and neglected zoonotic disease whose number of reported cases has been gradually increasing in Central Africa since 1980. This disease is caused by the monkeypox virus (MPXV), which belongs to the genus Orthopoxvirus in the family Poxviridae. Obtaining molecular data is particularly useful for establishing the relationships between the viral strains involved in outbreaks in countries affected by this disease. In this study, we evaluated the use of the MinION real-time sequencer as well as different polishing tools on MinION-sequenced genome for sequencing the MPXV genome originating from a pustular lesion in the context of an epidemic in a remote area of the Central African Republic. The reads corresponding to the MPXV genome were identified using two taxonomic classifiers, Kraken2 and Kaiju. Assembly of these reads led to a complete sequence of 196,956 bases, which is 6322 bases longer than the sequence previously obtained with Illumina sequencing from the same sample. The comparison of the two sequences showed mainly indels at the homopolymeric regions. However, the combined use of Canu with specific polishing tools such as Medaka and Homopolish was the best combination that reduced their numbers without adding mismatches. Although MinION sequencing is known to introduce a number of characteristic errors compared to Illumina sequencing, the new polishing tools allow a better-quality MinION-sequenced genome, thus to be used to help determine strain origin through phylogenetic analysis.

Monkeypox is an emerging and neglected disease of zoonotic origin that presents with maculopapular rashesparticularly on the palms of the hands and soles of the feet-sometimes very similar to those of smallpox 1,2 . Infection can also be associated with adenopathy. This infection is caused by the monkeypox virus (MPXV), which belongs to the genus Orthopoxvirus in the family Poxviridae. This large virus, whose genome is around 200 kb, was first isolated in 1958 from a monkey (Macaca fascicularis) originating from Singapore and imported to Copenhagen, Denmark that had caused an outbreak in captive Cynomolgus monkeys 3 . However, the precise animal reservoirs of MPXV have yet to be identified, although this virus was isolated once from a symptomatic squirrel (Funisciurus anerythrus) caught in 1985 in the Democratic Republic of Congo (DRC) near a village where a human case had been reported previously and once from a sooty mangabey (Cercocebus atys) in 2012 in Ivory Coast 4,5 . Although orthopoxvirus antibodies are not specific to MPXV, they have been detected in a large number of animal species living in Africa, including numerous non-human primates and rodents, suggesting the OPEN 1

Materials and methods
Organization of suspected case notification, collection of biological samples, DNA extraction of the monkeypox virus and molecular assays. Whenever a case of monkeypox is reported in CAR, standardized data collection procedures have been developed and validated by the Ministry of Health and the World Health Organization (WHO). These procedures consist of notification of suspected cases, collection of biological samples for diagnosis and data collection in the field by the investigation team. The biological samples were then sent to the IPB in the best possible storage conditions for virological investigations. Although independent samples of several pustules were collected, DNA extraction was performed on a single pustular lesion using the QIAamp DNA Mini kit, according to the manufacturer's instructions. Extracted DNA was quantified using a Qubit dsDNA High-Sensistivity Assay kit (Invitrogen) following the protocol according to the manufacturer's guidelines and stored at -20 °C until use in molecular investigations. MPXV was detected using a quantitative polymerase chain reaction (PCR), as previously described 32,33 . MinION library preparation and sequencing. Barcoded sequencing libraries were prepared from 400 ng DNA using the Rapid Barcoding Sequencing kit SQK-RBK004 (Oxford Nanopore Technologies) following the manufacturer's protocol. The optional clean-up steps using AMPure XP beads (Beckman Coulter) were carried out to increase throughput. The library was loaded onto an R9. 4  Bioinformatics analyses. Unlike the previous steps which were carried out at the Institut Pasteur of Bangui (IPB), the following steps were carried out at the Institut Pasteur in Paris after the transfer of the raw data acquired in real time in files containing 4000 sequences in FAST5 format. These files were basecalled and demultiplexed using Guppy (version 3.4.1) (Fig. 1). Microbial diversity was determined by using (1) the Kraken2 taxonomic classifier based on a custom extended RefSeq database (containing the RefSeq reference libraries for viral, bacterial, fungal, archaea, protozoan and plasmid genomes/proteins, the human GRCh38 human genome/proteins, as well as the NCBI non-redundant nucleotide database containing sequences from large environmental sequencing projects), and (2) Kaiju, a protein-level classifier, using an equivalent protein database [34][35][36] (Fig. 1). The taxonomic classification of raw reads results were visualized on the Krona web interface 37 . The identification of reads corresponding to MPXV were confirmed by alignment with a reference MPXV (NC_003310) using Minimap2 (version 2.9) for consensus sequence process, after which the called bases were corrected, trimmed, assembled and read-corrected using Canu (version 1.8) 31,38 . A caveat of long-read technology is the introduction of sequencing errors (mostly short indels), which was thus verified and corrected by polishing using HomoPolish (Release v0.3) (Fig. 1). The workflow used for the detection and annotation of variants in the MinION dataset consisted of an alignment of the reads using Minimap2 (version 2.9) 38 to the reference sequence of the same virus (GenBank Accession MN702446) obtained by Illumina sequencing and assembled using SPAdes (version 3.10). Then, from this alignment, the list of SNPs and indels was obtained using Freebayes (version 1.1) 39      Orthopoxvirus by Kaiju were assigned respectively as an unassigned orthopox or another orthopox species such as cowpox ( Table 1). Examination of the taxonomic assignments obtained from these two classifiers showed that the total reads assigned to the family of Poxviridae was 2223 reads, of which 2140 reads were common to both classifiers. However, some reads were assigned to the family Poxviridae by only one classifier. Analyses of the taxonomic classification results showed that 28 reads were assigned by Kraken2 only, while 55 reads were assigned by Kaiju only. In parallel with the use of these classifiers, mapping all 146,920 raw reads using Minimap2 on a reference MPXV sequence (NC_003310) identified 2171 matching reads, representing about 1.47% of the total reads. Finally, the size of these reads varied between 168 bp and 44 kb for the largest reads.

Comparison of Illumina-and MinION-sequenced genomes. All 2171 MinION reads correspond-
ing to MPXV produced a sequence (MPXV-M) of 196,956 bp with a depth ranging from 12× to 57×. In 94% of the nucleic base positions, the depth was between 30× and 59× with an average depth of 39.72×. Canu was used to both assemble reads classified as belonging to the MonkeyPox genome into draft genomes and then to also remove random sequencing errors. Comparison with the Illumina sequence showed that the two sequences aligned well, but that the MPXV-M sequence was longer than the one initially obtained with the Illumina data (MPXV-I-MN702446), which had a length of only 190,357 bp. The MPXV-I sequence begins at base 573 and ends at base 190,634 with respect to the MPXV-M sequence. However, a retrospective analysis of raw Illumina data found a total of 5924 reads that matched the two 'missing ends' of the MPXV-I sequence, even though the majority of these mapped to the terminal repeated regions of both ends of the genome (data not shown). The comparison between the MPXV-I sequence (MN702446) and the MPXV-M sequence revealed insertions or deletions located in homopolymers of at least 6 bases that have been identified in this genome (Table S1 and Fig. 3). However, the major remaining differences between the two genomic sequences were between positions 178,570 and 178,664 with MPXV-M as the reference sequence. Detailed analyses of these differences showed that these 95 bp of the MPXV-I sequence actually corresponded to a region between bases 196,170 and 196,265 of the MPXV-M sequence. However, retrospective analysis using the raw data from Illumina sequencing showed that no reads corresponding to these 95 bp from the region between 178,570 and 178,664 were available. Finally, analysis of the reference sequence (KP849471) that was used for probe design for our targeted enrichment also did not contain this 95 bp region. The absence of this 95 bp region in this reference sequence explains the absence of a capture probe targeted in this genomic region and consequently the absence in our Illuminasequenced genome (MPXV-I).
In addition to those differences, the central coding region sequence (CRS; roughly between 56,000-120,000) in the MPXV genome is highly conserved and is flanked by variable ends that contain inverted terminal repeats (ITRs). Those ITR's comprise as much as 1% of the genome, are prone to hair-pin loop-outs 46,47 and contain at least 4 ORFs 48,49 . As such, the ITR regions represent global repeats (i.e., long sequence which is duplicated throughout the genome) and contain local repeated sequences. In the MPXV genome, local tandem repeats are found both within the ITR regions and outside these regions. In contrast to global repeats, the local repeat contains simple sequences, that are tandem duplicated, and therefore pose a special challenge to genome-assembly tools. The GenBank reference MN702446 was obtained using an Illumina short-read sequencing approach. The analysis of this sequence using Tandem Repeat Finder (TRF; 50 ) showed that the left-most ITR sequences could be identified, but showed that the corresponding right-most ITR is missing in the final sequence. Figure 4 shows the tandem repeat locations on the x-axis in the MPXV genome, and the y-axis shows their period size (green bars) and the length (red bars) of the tandem repeat region. This in turn highlights the shortcomings of the assembly of such a repeat-rich sequencing dataset based on short-read sequencing technology. The sequence obtained in www.nature.com/scientificreports/ the present study, using long-read MinIon technology, analyzed using TRF with the same parameters, showed that the repeats could also be identified in the right-most ITR sequence Fig. 4 shows the equivalent Tandem Repeats found by TRF for the Illumina sequence (Genbank MN702446). The long reads spanning both ITRs could in fact more easily discriminate both ITR's and allow to finish the genome. However, despite the presence of these differences in the MPXV-M sequence compared with the MPXV-I sequence, phylogenetic analyses showed that the MPXV-M sequence is positioned in the CAR1 lineage of group II strains belonging to the Central African clade. The position of the MPXV-M sequence, along with the Bao (A4 and A5) and Bangassou (B2) sequences of 2016 and 2017, respectively, is identical to the phylogenetic analyses previously performed in a previous study using all CAR MPXV sequences obtained with Illumina sequencing 19 (Fig. 5).

Evaluation of polishing tools on MinION-sequenced genome.. A major issue in nanopore
sequencing is the basecalling in homopolymer-rich regions. Basecallers often tend to collapse homopolymers into shorter stretches if the homopolymer length exceeds the number of bases simultaneously influencing the measured current in the nanopore-device, resulting in a higher deletion rate in nanopore reads. As a consequence, the homopolymer containing stretches need to be checked with a dedicated polisher to increase accuracy in downstream steps of the genome finishing process. Therefore, basecalled reads are often assembled into a consensus sequence after which they are mapped back to the assembly to improve the consensus by so-called polishers. Although Canu in fact by default performs a read-correction step, dedicated polishing tools are used in combination with Canu, such as Medaka, Racon or Homopolish, alone or in combination. In order to remove the remaining systematic errors, the efficiency of the polishing operation was evaluated using the sequence for this genome, obtained by Illumina-sequencing, as a reference (MN702446). The different combinations of polishing tools show differences in the number of indels corrected. Indeed, the different combinations of Canu for assembly with polishing tools, such as Canu/Homopolish, but also Canu/Medaka/Homopolish and Canu/ Racon/Medaka/Homopolish, give a better correction of indels compared to the other combinations tested. Indeed, the number of indels compared to the use of Canu alone is reduced. It is reduced from 433 to a number varying from 156 to 159 ( Table 2). The other combinations tested, such as Canu/Medaka, Canu/Racon or Canu/Racon/Medaka, improve the error correction compared to Canu alone, but less efficiently. Indeed, there are still between 236 and 375 indels depending on the combinations used (Table 2). Even if the assembly performed with Canu is not satisfactory compared to the combinations with the other tools in terms of indels (433 compared to 156 to 375), no mismatch was observed compared to the reference sequence. In contrast, the  (Table 2). These mismatches are relatively rare, although they occur in regions of relatively constant sequencing depth (around 70×). The 2 mismatches resulting from polishing with Homopolish, which are 2 A-T transversions, occur in the first half of the genome (about 20-26%, in genome coordinates), while the singular mismatches occurring using Racon, which are A-C transversions, occur in the second identical half of the genome (about 83%, in genome coordinates), and are thus apparently the same SNP  www.nature.com/scientificreports/ events. The combinations tested also have variable efficiency at homopolymeric regions of size 9. Indeed, the Canu/Homopolish combination provides the best accuracy (87.5%) in the polishing procedure in these regions (Fig. 6). Moreover, it is significantly better than the Canu/Medaka combination or Canu/Racon alone (Fig. 6). In contrast, the combination of Medaka with Homopolish did not provide any improvement in error correction in these homopolymeric regions. Furthermore, the genomic coordinates of the homopolymeric sequences appear to be distributed over the entire length of the final assembly after polishing, and are therefore not restricted to the beginning and end of the assembly as illustrated with Figs. 7 and 8, showing all polymeric sequences and aberrations in the length of the deleted polymeric sequences with the precise genomic location of the remaining polymeric sequences. In conclusion, polishing of the raw and corrected reads improved the final assembly, most essentially using the Homopolish tool, notable through the Q-values that are maxing out to 30 Phred value,

Discussion
In this study, the portable real-time ONT sequencer (MinION) was used for the first time to target and sequence the full MPXV genome from DNA isolated from a human clinical sample in the CAR. Genomic data acquisition and analysis, whether for SGS or TGS, depend on a number of computer-intensive steps. After acquisition, basecalling is the initial step and is ideally performed on a dedicated computer system, given the large amount of data and metadata generated. ONT offers real-time basecalling using cloud services. However, even today, field laboratories in resource-constrained or remote locations often lack stable internet connections, making downstream analyses problematic and severely hindering sample identification in the field. Thus, for field studies, it is necessary to perform private, offline analyses of MinION data. The necessary tools include software for diversity analysis and for classification to rapidly identify species. The most common tools for MinION TGS sequences are carried out using sequence classification tools such as Centrifuge, Kraken2 and Kaiju; reference mapping tools such as MiniMap2 and BWA; and a set of variant analysis tools. Kraken2 was selected among all available DNAto-DNA classifiers (Kraken2, KrakenUniq, k-SLAM, MegaBLAST, metaOthello, CLARK, CLARK-S, GOTTCHA, TaxMaps, Prophyle, PathSeq, Centrifuge and Karp), because it was available on a dedicated Galaxy platform with an appropriate database (MiniKraken) 35,36,[51][52][53][54][55][56][57][58][59][60] . In addition, Kraken2 has good performance footprint measures and is very fast on a large number of samples. From the available DNA-protein classifiers (DIAMOND, Kaiju and MMseqs2), Kaiju was chosen because it generally has a much faster classification speed and lower memory footprint requirements than the DIAMOND and MMseqs2 classifiers, without compromising performance 34,61,62 .
Most state-of-the-art tools, such as Kraken2 and Kaiju are designed for short reads. They use pseudo alignments, i.e., the exact or approximate matches from reads to a reference as signals to perform classification. Kraken2 utilizes spaced seeds in the storage and querying of minimizers to improve classification accuracy. To run efficiently, Kraken2 requires enough free memory to hold the database (primarily the hash table) in RAM (default database size, 29 GB). The custom extended RefSeq database we used (containing the RefSeq reference libraries for viral, bacterial, fungal, archaea, protozoan and plasmid genomes/proteins, the human GRCh38 human www.nature.com/scientificreports/ genome/proteins, as well as the NCBI non-redundant nucleotide database containing sequences from large environmental sequencing projects) required a server that had at least 60 GB of RAM, which is currently fairly common. Kraken2 was used in multi-threading mode, and typically required (at the time of writing) approximately 50 min to process a typical sequencing run. For Kaiju, the reference index was built using an equivalent NCBI protein database due to its specifically designed read classification approach, and required approximately 40 GB of RAM. Run times were slightly increased compared with Kraken2, on a comparable number of CPUs. However, given that the two selected classifiers (Kraken2 and Kaiju) operate on different principles, it was almost impossible to compare the accuracy of the classification of each method without using another reference method. Kraken uses k-mers of fixed but variable length identified in the reads. Then, it matches them to the k-mers as defined in its index constructed from reference genomes. Kaiju finds the maximum number of exact matches between the reads and the indexed database. These differences mean that Kaiju should be able to classify more reads, but with less accuracy than Kraken2 63 . Likewise, Kaiju classified 36,102 reads as bacteria, whereas Kraken2 classified only 1423 as bacteria. However, the number of reads classified by both methods as belonging to the Poxviridae family was similar (2168 versus 2195, respectively, for Kraken2 and Kaiju). Although a few reads were specifically assigned to this family by either method, Kraken2 did not identify any other viral reads, unlike Kaiju, which identified 127. Despite these small differences in the results of viral read classification, both classifiers gave similar results in more than 98% of the cases. Although BLASTN alone would be more efficient compared with either Kraken2 and Kaiju, its use in field conditions is not possible 34 . Finally, the combination of results obtained with these two classifiers compensated for the lack of precision and the lack of sensitivity by making it possible to identify all the viral reads corresponding to MPXV and to use them for further analysis.
Although this TGS type of technology is known to have a higher error rate compared with SGS on the Illumina platform, few errors were detected in the genome sequence of our MPXV strain. The difference in genome size obtained using the MinION and Illumina sequencing approaches was significant given that the assembly based on the short reads obtained with Illumina was suboptimal in terminal repeat regions. Therefore, the generation of longer reads (with the MinION) provides an advantage in closing the gap on repeat regions such as terminal repeats that are more difficult to assemble with a short-read sequencing approach. In addition, TGS was able to ' correct' a region that had not been correctly sequenced and assembled from the data generated by Illumina sequencing. A study on an old African swine fever virus (ASFV) strain from 2007 resequenced using TGS also improved the length of the inverted terminal repeat sequences compared with the genome obtained previously with SGS (Illumina) 64 . In addition, a major issue in nanopore sequencing is the basecalling in homopolymer-rich regions. Basecaller can accurately call homopolymers up to six bases in length, but without significant improvement for longer homopolymers. Current basecallers often include homopolymer correction settings, although the effect appears to be small, also in our dataset. Therefore, in order to resolve mismatched stretches related to homopolymer sequences, polishers such as Nanopolish and HomoPolish propose to map the reads back to the assembly in order to improve the assembly-consensus, which after polishing is in general very high, but depending on the genome size, it can be a computationally expensive process. As in this study, the 71 detected errors were homopolymers 64 . Polishing the Canu assemblies using NanoPolish (v 0.13.2) is recommended to improve the accuracy of the sequence. Unfortunately, here, polishing did not improve either the final assembly or the accuracy of the final sequence, due to the mapping-based read ordering described above, which was primarily intended to facilitate and guide the assembly process. In contrast, HomoPolish (https:// github. com/ ythua ng0522/ homop olish) 65 , a tool relying on a machine-learning model, trained to correct systematic errors that occur in Nanopore sequencing, corrected a number of mismatches and provided a polished version of the assembly compared with the original reference sequence (NC_003310). However, using the reference sequence (obtained with Illumina) did not correct the errors observed in the homopolymers. In addition, using an alternative assembly method based on SPAdes (v. 3.10) to check the robustness of the assembly approach did not produce better assembly results (data not shown).
Despite the differences observed in the homopolymer regions when sequencing the MPXV and ASFV genomes compared with SGS, TGS identifies and characterizes these viral genomes much faster than SGS, and allows to more clearly discriminate ITR sequences, which in turn allows to obtain a longer finished assembly. With SGS, the raw data can only be analyzed after sequencing has been completed, but the MinION data can be analyzed in real time. For example, the ASFV virus was identified as early as 6 min after the start of sequencing and the whole genome within hours 66 . Many examples of the use of TGS have shown that real-time data analysis is the main advantage of this technology over SGS technologies. Rapid identification is a major advantage, especially for viruses that spread very quickly and have serious life-threatening consequences. Although TGS is not the most suitable tool for rapid differential diagnosis of MPXV to distinguish it from Varicella-Zoster virus, its main advantage resides in the very fast identification of the origin of the strain. Our phylogenetic analysis showed that the MPXV-M sequence was sufficiently precise to position it correctly in the CAR1 lineage 19 . Molecular data obtained with TGS makes it possible to identify the relationships between detected cases more rapidly, especially those identified in the same city within a very short timeframe. For instance, in the city of Rafai, several cases had been reported in the space of a few weeks between March and April 2018 67 , but only phylogenetic analyses were able to resolve the specific relationship between them 19 . Similarly, early differential diagnosis of MPXV, distinguishing it from Varicella-Zoster virus, and as close as possible to the index case can help limit its spread to the immediate environment. Secondary cases frequently occur either among family members or among staff at the healthcare facility where the cases are treated. Rapid identification of monkeypox within a few hours can facilitate the isolation of the patient, where possible, and the increased use of personal protective equipment to prevent the spread of the disease. Even though the use of MinION technology is relatively simple and compatible with field use, the handling of biological samples (blood, scabs or pustular lesions, etc.) where a potentially highly contagious agent is present may expose the investigator, without adequate protective equipment, to infection until the sample under investigation is fully inactivated. www.nature.com/scientificreports/ In conclusion, this study confirms the usefulness of MinION technology for sequencing the genome of an MPXV virus in the context of an outbreak. Here, we show that the data obtained from directly sequencing DNA extracted from a lesion is sufficient to obtain the complete genome of the virus. The quality of the sequence obtained is suitable to provide information on the origin of the virus with sufficient accuracy despite minor errors related to the acquisition of reads observed in the homopolymeric regions.

Data availability
The corresponding MinION raw data are available under the BioProject ID (PRJNA762014).